# Q1
bnb <- read.csv("sf_listings_2312.csv")01 - Prediction
Pre-Class Code Assignment Instructions
In this semester, I am going to ask you to do a fair bit of work before coming to class. This will make our class time shorter, more manageable, and hopefully less boring.
I am also going to use this as an opportunity for you to directly earn grade points for your effort/labor, rather than “getting things right” on an exam.
Therefore, I will ask you to work through the posted slides on Canvas before class. Throughout the document, I will post Pre-Class Questions for you to work through in R. These will look like this:
Pre-Class Q1
In R, please write code that will read in the .csv from Canvas called sf_listings_2312.csv. Assign this the name bnb.
You will then write your answer in a .r script:
Important:
To earn full points, you need to organize your code correctly. Specifically, you need to:
- Answer questions in order.
- If you answer them out of order, just re-arrange the code after.
- Preface each answer with a comment (
# Q1/# Q2/# Q3) that indicates exactly which question you are answering.- Please just write the letter Q and the number in this comment.
- Make sure your code runs on its own, on anyone’s computer.
- To do this, I would always include
rm(list = ls())at the top of every .r script. This will clean everything from the environment, allowing you to see if this runs on my computer.
- To do this, I would always include
Handing this in:
- You must submit this to Canvas before 9:00am on the day of class. Even if class starts at 10:00am that day, these are always due at 9:00.
- You must submit this code as a
.txtfile. This is because Canvas cannot present.Rfiles to me in SpeedGrader. To save as.txt:- Click File -> New File -> Text File
- Copy and paste your completed code to that new text file.
- Save the file as
firstname_lastname_module.txt- For example, my file for Module 01 would be
matt_meister_01.txt - My file for module 05 would be
matt_meister_05.txt
- For example, my file for Module 01 would be
Grading:
- I will grade these for completion.
- You will receive 1 point for every question you give an honest attempt to answer
- Your grade will be the number of questions you answer, divided by the total number of questions.
- This is why it is important that you number each answer with
# Q1. - Any questions that are not numbered this way will be graded incomplete, because I can’t find them.
- This is why it is important that you number each answer with
- You will receive a 25% penalty for submitting these late.
- I will post my solutions after class.
Prediction
Load in these packages. If you do not have them, you will need to install them.
- e.g.,
install.packages("tidytext")
library(tidytext)
library(stringr)
library(dplyr)
library(ggplot2)And read in these data:
bnb <- read.csv("sf_listings_2312.csv")Why are we doing this?
The goal of data analysis/science is to make predictions about the future.
We do not just analyze data to find some neat facts about Airbnb properties (for example). We analyze data to understand how the world works. Because that will give us our best guess about how the world will work in the future.
And we have some control over the future! The things we discover about the past are helpful only because they can inform the choices we make for the future.
For example:
One of the common praises of Airbnbs is that they are often larger than hotels, allowing friends to stay together. If it is true (that staying together is a benefit of Airbnbs), we should find that Airbnbs that accommodate more people get higher ratings than Airbnbs who accommodate fewer people.
Pre-Class Q2
In R, test with linear regression whether listings that accommodate more people receive higher ratings.
Note: The variable for average rating is called review_scores_rating.
summary(lm(data = bnb, review_scores_rating ~ accommodates))
Call:
lm(formula = review_scores_rating ~ accommodates, data = bnb)
Residuals:
Min 1Q Median 3Q Max
-3.8139 -0.0364 0.1267 0.2130 0.2936
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.679486 0.010714 436.767 <2e-16 ***
accommodates 0.026885 0.002819 9.537 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4394 on 6173 degrees of freedom
(1881 observations deleted due to missingness)
Multiple R-squared: 0.01452, Adjusted R-squared: 0.01436
F-statistic: 90.95 on 1 and 6173 DF, p-value: < 2.2e-16
Pre-Class Q3
Do listings who accommodate more people receive higher ratings? Answer this in a comment.
# Yes. The coefficient is > 0, and the p-value is very low. This suggests that places who accommodate more people receive significantly higher ratings.This result tells us something about the future because the p-value indicates that the relationship between review_scores_rating and accommodates is stronger than we would expect from chance alone. Therefore, it is not due to randomness that larger places received higher ratings.
One condition that we have to satisfy in order to use past data to predict the future is that the results we observe must not be due to chance. Because chance is random. This is easy to test by looking at p-values. We have just done this for a single variable (accommodates), but we can also do it for multiple variables.
Model Comparison
The world is a complicated place. While we often care about individual variables, it is more common that we want to create (semi-) complex models, which use multiple variables to predict an outcome. To know whether adding this complexity is worth it, or which of a series of models will best predict the future, we have to compare those models.
We will cover two forms of model comparison. First, using anova(). Second, splitting our data into training and testing sets.
Anova
Anova is an acronym for analysis of variance. It takes two models, and analyzes the degree to which the variance reduced by the more complex one is larger than chance.
Advantages:
- Relatively simple to code and run
- Requires less data than training/testing sets
- Provides a p-value, which is familiar
Disadvantages
- Not very common anymore (because we have more data than we used to)
- Cannot compare models of equal complexity
- e.g., Cannot compare
rating ~ pricetorating ~ accommodates
- e.g., Cannot compare
That second disadvantage is pretty substantial. As a result, you won’t see this as much.
If I wanted to see whether a model with accommodates and price predicted review_scores_rating better than a model with accommodates alone, I could do the following.
First, let’s look at the variables we have:
hist(bnb$review_scores_rating)hist(bnb$accommodates)hist(bnb$price)Error in hist.default(bnb$price): 'x' must be numeric
Whoa! price threw an error. This is because it is not numeric. It has that annoying $ before it, and also has a , when prices are in the thousands. We’ll need to remove those with the following code:
Pre-Class Q4
Just copy this into R.
# First, remove the dollar sign
bnb$price <- str_remove(bnb$price, "\\$")
# You need the "\\" because $ is a special character with regular expressions, but we just want the symbol gone.
# Second, remove the comma
bnb$price <- str_remove(bnb$price, ",")
# Third, make this a number
bnb$price <- as.numeric(bnb$price)Pre-Class Q5
See whether a model with accommodates and price predicted review_scores_rating better than a model with accommodates alone.
model_simple <- lm(data = bnb, review_scores_rating ~ accommodates)
model_complex <- lm(data = bnb, review_scores_rating ~ accommodates + price)
anova(model_simple, model_complex)Analysis of Variance Table
Model 1: review_scores_rating ~ accommodates
Model 2: review_scores_rating ~ accommodates + price
Res.Df RSS Df Sum of Sq F Pr(>F)
1 6173 1192.0
2 6172 1185.8 1 6.181 32.171 1.476e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
That p-value being extremely small tells me that the variance reduced by the more complex model was well worth the added complexity. This should make us confident that using the complex model will lead to better predictions in the future than using the simpler model.
Train/Test Splits
Another way we can do this model comparison is by splitting our data into two sets (training and testing), and seeing how well each model that we fit to the training set predicts the testing set.
Advantages:
- Super common (especially in machine learning)
- Very flexible–can use different variables, models, etc
- Rather than a p-value, we can directly compare the error between models
- This is because the p-value is approximating what we are actually doing here
Disadvantages
- Needs more data to be reliable
- More complex to code
First: split our data into a training and a testing set
- Usually, the training set is 70%-80% of the data.
- We do this by randomly sampling our data with
sample().- We should randomly sample the
id, as this is unique to each row.
- We should randomly sample the
- Because there is randomness, we will use
set.seed()to get consistent results.
Pre-Class Q6
Create splits of bnb into training and testing data.
set.seed(28)
samp_size <- round(.7 * nrow(bnb)) # I am rounding this to get a whole number
train <- sample(bnb$id, size = samp_size, replace = FALSE)
bnb_train <- bnb[bnb$id %in% train,] # Index bnb where the id is inside train
bnb_test <- bnb[!(bnb$id %in% train),] # Index bnb where the id is NOT inside trainSecond: train each model on the training set
Pre-Class Q7
model_simple_train <- lm(data = bnb_train, review_scores_rating ~ accommodates)
model_complex_train <- lm(data = bnb_train, review_scores_rating ~ accommodates + price)Note: You need to use bnb_train as the data, not bnb. Also, if you look at model_simple and model_simple_train, you will see the coefficients are slightly different. Think of why this is.
Third: test each model on the testing set.
Pre-Class Q8
Do this by:
- Making predictions about
review_scores_ratingfrom each model. - Calculating the squared error for each model.
- Comparing those numbers.
#a) predictions
bnb_test$prediction_simple <- predict(object = model_simple_train,
newdata = bnb_test,
type = "response")
bnb_test$prediction_complex <- predict(object = model_complex_train,
newdata = bnb_test,
type = "response")
#b) calculate error
bnb_test$error_simple <- (bnb_test$prediction_simple - bnb_test$review_scores_rating)^2
bnb_test$error_complex <- (bnb_test$prediction_complex - bnb_test$review_scores_rating)^2
#c) compare error
mean(bnb_test$error_simple, na.rm = T)[1] 0.2002133
mean(bnb_test$error_complex, na.rm = T)[1] 0.1999402
predict()spits out a prediction for each row in the new data when we specifytype = "response".- This is why we can just attach it as a column.
In this case, because we are actually making predictions, we do not need p-values. We can just look at the reduction in error itself.
One thing you will notice is that we barely reduced any error from the average prediction. If that seems odd, you can try different things:
- Cross-validation (described here)
- This repeats the process we just did (splitting, training, testing) multiple times on the same set of data, potentially reducing the impact of randomness in our sampling
- Variable selection
priceis not normally distributed at all. So we may be getting some weirdness in the results as a result. You could see how results change if you transformpricesomehow.
Actually Making Predictions
The simple model (from the anova() example) predicts review_scores_rating as the following function:
review_scores_ratingi = 4.68 + 0.027 \(\times\) accommodatesi- In this function, \(i\) denotes any individual listing.
The complex model predicts review_scores_rating as the following function:
review_scores_ratingi = 4.68 + 0.027 \(\times\) accommodatesi + -0.000016 \(\times\) pricei
Why does this matter?
With the simple model, we predict that the cheapest and the most expensive 4-person Airbnb would have the exact same ratings.
Pre-Class Q9
See these predictions with predict(). To use predict(), we need to create a new data.frame that has the exact same variables as the variables in our models.
# Minimum price for a four-person bnb
min_fourperson <- min(bnb[bnb$accommodates == 4, ]$price)
# Maximum price for a four-person bnb
max_fourperson <- max(bnb[bnb$accommodates == 4, ]$price)
new_data <- data.frame(
accommodates = c(4, 4),
price = c(min_fourperson, max_fourperson)
)
predict(object = model_simple, #Predict takes an "object" (a model)
newdata = new_data, # and it takes "newdata"
type = 'response') 1 2
4.787026 4.787026
Pre-Class Q10
With the complex model, we do not predict that the cheapest and the most expensive 4-person Airbnb would have the exact same ratings. On your own, I want you to make those predictions for the complex model.
Making variables behave
Now, if you look at the distribution of price, you will see that it is a bit… crazy. For example, the maximum price of a four-person Airbnb is $50,000 per night!
You will probably get better predictions if you enforce some normality on price. On your own, I would like you to:
Pre-Class Q11
- Create a new variable called
bnb$log_price, which will be the natural log ofbnb$price(Google this). - Create a new model called
model_complex_log, usinglog_pricein place ofprice. - Use `anova() to compare the simple model to this new complex model. Is it still better?
You will probably get better predictions if you enforce some normality on price.
Final Questions
These questions are for added points/difficulty. They count towards the total score.
Pre-Class Q12
You are not going to be able to test the original model_complex against model_complex_log using anova(). But, you can test these two using the training/testing example from earlier. Try to do that here.
Pre-Class Q13
I would like you to challenge yourself here, by attempting to test a totally new set of models. Predict review_scores_rating with new variables, using both anova() and training/testing splits.
In Class
In class, we are going to try out making predictions of review_scores_rating using whatever variables we want. You will then hand in your best models as your homework.
If you just use the variables that are already in the data set, you will struggle. Instead, I suggest creating new ones, cleaning the ones you have, etc.
For example, you might want to distinguish places that accommodate more than four people categorically:
bnb$accommodates_5plus <- ifelse(bnb$accommodates > 4, 1, 0)And then see if this predicts ratings:
model_5plus <- lm(data = bnb, review_scores_rating ~ accommodates_5plus)
summary(model_5plus)
Call:
lm(formula = review_scores_rating ~ accommodates_5plus, data = bnb)
Residuals:
Min 1Q Median 3Q Max
-3.8582 -0.0361 0.1339 0.2439 0.2539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.746147 0.006202 765.315 < 2e-16 ***
accommodates_5plus 0.112090 0.014503 7.728 1.26e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4405 on 6173 degrees of freedom
(1881 observations deleted due to missingness)
Multiple R-squared: 0.009583, Adjusted R-squared: 0.009423
F-statistic: 59.73 on 1 and 6173 DF, p-value: 1.262e-14
You might also want to pull out some specific amenity. Any time you are handling text, make it lower case!
bnb$amenities <- tolower(bnb$amenities)
bnb$kitchen <- ifelse(str_detect(bnb$amenities, 'kitchen'), 1, 0)